Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce
ثبت نشده
چکیده
A Parallel Frequent Item sets mining algorithm called FiDoop using MapReduce programming model. FiDoop includes the frequent items ultrametric tree(FIU-tree), in that three MapReduce jobs are applied to complete the mining task. The scalability problem has been addressed bythe implementation of a handful of FP-growth-like parallelFIM algorithms. InFiDoop, the mappers independently and concurrently decompose item sets; the reducers perform combination operationsby constructing small ultrametric trees as well as miningthese trees in parallel. Data Deduplication is one of important data compression method for erasing duplicate copies of repeating data and reduce the amount of storage space and save bandwidth.The technique is used to improve storage space utilization and can also be applied to reduce the number of bytes. The first MapReduce job discovers all frequent items, the second MapReduce job scans the database to generate k-item sets by removing infrequent items, and the third MapReduce job complicated one to constructs k-FIU-tree and mines all frequent k-item sets. In this paper, we applying Deduplication technique in third MapReduce job to avoid the replication of data in frequent item sets and improve the performance. It produces highly related mining results with less time and increase the storage capacity. Hadoop supports nine different tools, while Mahout is based on core algorithm and classifications. Having sequence algorithm to produce the output in better way. We aim to implement recommendation algorithm using Mahout, a machine learning device, on Hadoop platform to provide a scalable system for processing large data sets efficiently. This can be performed on such platforms for quicker performance.
منابع مشابه
An Improved Technique Of Extracting Frequent Itemsets From Massive Data Using MapReduce
The mining of frequent itemsets is a basic and essential work in many data mining applications. Frequent itemsets extraction with frequent pattern and rules boosts the applications like Association rule mining, co-relations also in product sale and marketing. In extraction process of frequent itemsets there are number of algorithms used Like FP-growth,E-clat etc. But unfortunately these algorit...
متن کاملParallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment
Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce fra...
متن کاملWeighted Itemset Mining from Bigdata using Hadoop
Data items have been extracted using an empirical data mining technique called frequent itemset mining. In majority of theapplication contexts items are enriched with weights. Pushing an item weights into the itemset extraction process, i.e., mining weighted itemsets rather than traditional itemsets, is an appealing research direction. Although many efficient weighteditemset mining algorithms a...
متن کاملPerformance Evaluation of Apriori Algorithm on a Hadoop Cluster
Frequent Itemset Mining is a well-known concept in data sciences. If we feed frequent itemset miner algorithms with large datasets they become resource hungry fast as their search space explodes. This problem is even more apparent when we try to use them on Big Data. Recent advances in parallel programming provides good solutions to deal with large datasets but they present their own problems w...
متن کاملMining Frequent Item Sets Using Map Reduce Paradigm
In Text categorization techniques like Text classification or clustering, finding frequent item sets is an acquainted method in the current research trends. Even though finding frequent item sets using Apriori algorithm is a widespread method, later DHP, partitioning, sampling, DIC, Eclat, FP-growth, H-mine algorithms were shown better performance than Apriori in standalone systems. In real sce...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016